Method for bidirectional sequencing

ABSTRACT

Described herein is a method of sequencing, comprising: splitting an asymmetrically tagged library into a plurality of subsamples, tagging the adaptor-ligated DNA in the sub-samples with sequence tags that identify the subsamples, optionally pooling the sub-samples, sequencing polynucleotides from each of the tagged sub-samples, or copies of the same, to produce sequence reads each comprising: i. a sub-sample identifier sequence and ii. the sequence of at least part of a fragment in the sample, wherein some of the sequence reads are derived from the top strand of a fragment in the sample and some of the sequence reads of are derived from the bottom strand of the same fragment.

CROSS-REFERENCING

This application claims the benefit of UK patent application serial no.1515557.5, which application is incorporated by reference herein.

BACKGROUND

Many diseases are caused by somatic mutations. Because somatic mutationsonly occur in a fraction of the cells in the body, they can be verydifficult to detect by next generation sequencing (NGS). One problem isthat every library preparation method and sequencing platform results insequence reads that contain errors, e.g., PCR errors and sequencingerrors. While it is sometimes possible to correct systematic errors(e.g., those that are correlated with known parameters includingsequencing cycle-number, strand, sequence-context and base substitutionprobabilities), it is often impossible to figure out with any certaintywhether a variation in a sequence is caused by an error or if it is a“real” mutation. This problem is exacerbated if the amount of sample islimited and mutation-containing polynucleotides are present only atrelatively low levels, e.g., less than 5%, in the sample. For example,if a sample contains only one copy of a mutation-containingpolynucleotide in a background of hundreds of polynucleotides that areotherwise identical to the mutation-containing polynucleotide exceptthat they do not contain the mutation, then, after those polynucleotideshave been sequenced, it is often impossible to tell whether thevariation (which may only be observed in about 1/100 of the sequencereads) is an error that occurred during amplification or sequencing.Thus, the detection of somatic mutations that cause diseases can beextremely difficult to detect with any certainty.

Schmitt et al (Proc. Natl. Acad. Sci. 2012 109: 14508-14513) proposed asolution that involves tagging a sample with custom Y-shaped adaptors.The Y-shaped adaptors were generated by first introducing asingle-stranded randomized nucleotide sequence into the stem region ofone adaptor strand. A second adaptor strand, is then extended using aDNA polymerase, to generate a Y-shaped adaptor where the stem region hasa complementary double-stranded randomized nucleotide sequence. Adaptorsare then tailed by adding a 3′ base overhang using a DNA polymerase. Theoverhang assists in adaptor ligation to fragmented DNA tailed with acomplementary base. Adaptors are ligated to fragmented DNA before thelibrary is PCR amplified, using primers that hybridize to the singlestranded (non-stem) region of the adaptors. Tag sequences allow readsderiving from the top strands of fragmented DNA to be discriminated fromsequence reads derived from bottom strands. This requires paired-endsequencing and comparison of both read 1 and read 2 tag sequences.Schmitt's method is based on the idea that for a true DNA mutation,complementary substitutions should be present on both strands and, assuch, a mutation can only be called with confidence if it is present insequences from both strands.

While useful for discriminating between sequence reads from top andbottom strands, Schmitt's method has several limitations.

First, because sequence tags are random it can be difficult to identifytags that have been ‘mutated’ due to PCR or sequencing errors. Inaddition, it is difficult to detect errors that occurred duringoligonucleotide synthesis, such as n−1 deletions.

Second, manufacture of the double-stranded adaptors is complex andexpensive. Typically, Y-shaped adaptors are manufactured by annealingtwo oligonucleotides. In contrast, Schmitt's method also includesincorporation of random bases into the oligonucleotide, DNA polymeraseextension and tailing different 3′ bases onto adaptors and fragmentedDNA. These steps can be inefficient and require additional purificationand quality control checks.

Third, it is difficult to control the relative incorporation ofdifferent bases in the degenerate tag sequence during oligonucleotidesynthesis. This can result in some tag sequences being present at higherlevels than others in the pool of Y-shaped adaptors, which reduces theprobability that a fragment is tagged with unique tag sequences.

Fourth, because the tags are attached to the template in bulk, a numberof tag sequences are required to reduce the probability of differentfragments being attached to adaptors containing the same tag sequencesand to improve the chance of detecting a PCR and/or sequencing errorthat results in one tag being ‘mutated’ into another. As a result, thetag sequences are typically relatively long; for example, Schmitt use 12nucleotide tag sequences. However, long runs of random bases are likelyto form intra- and inter-molecular hybrids, which can cause problems fordownstream applications such as in-solution hybridization. In thisapplication, adaptor and index sequences are ‘masked’ to reduce theeffect of inter-molecular hybridisation by including blockingoligonucleotides in the hybridization. However, masking the degenerateregion of the tag requires incorporation of ‘universal’ bases such asinosine, with associated additional costs. In addition, tag sequencesuse up a proportion of each sequencing read thereby reducing thesequence data from target fragment(s). This effect is increased if tagsequences are longer.

Fifth, in the Schmitt protocol the PCR is performed on a pool ofmolecules, each tagged with different 5′ and 3′ random tag sequences.Failure to remove residual adaptors can result in hybridization of anadaptor strand to a template molecule during PCR, which caninadvertently tag a molecule with a different tag sequence.

Finally, Schmitt tag DNA fragments with both 5′ and 3′ random tagsequences. If, by chance, the 5′ and 3′ tag sequences are complementary,or partially complementary, then the tags can intra-molecularlyhybridize resulting in suppression of PCR amplification. This can resultin uneven amplification of template molecules, depending on their 5′ and3′ tag combination.

The present disclosure provides an alternative, better, way for taggingDNA molecules in a way that the sequence reads from top and bottomstrands can be discriminated.

SUMMARY

Described herein is a method of sequencing. In some embodiments themethod may comprise: splitting an asymmetrically tagged library into aplurality of subsamples, tagging the adaptor-ligated DNA in thesub-samples with sequence tags that identify the sub-samples, optionallypooling the sub-samples, sequencing polynucleotides from each of thetagged sub-samples, or copies of the same, to produce sequence readseach comprising: (i). a sub-sample identifier sequence and (ii). thesequence of at least part of a fragment in the sample, where at leastsome of the sequence reads are derived from the top strand of a fragmentin the sample and some of the sequence reads of are derived from thebottom strand of the same fragment.

As will be described in greater detail below (and illustrated in FIG. 1), the method provides sequence reads in which sequence reads that arederived from different strands (i.e., the top and bottom strands) of thesame fragment (i.e., the same original double-stranded molecule) can bedistinguished. In addition, the method provides a way to distinguishsequence reads that are derived from fragments that are otherwiseidentical. These features allow one to identify sequence variation withexceptional confidence.

The method finds particular use in analyzing samples of DNA in which theamount of DNA, or diversity of fragmentation breakpoints, is limitedand/or that contain fragments having a low copy number mutation (e.g. asequence caused by a mutation that is present at low copy numberrelative to sequences that do not contain the mutation). These featuresare often present in patient samples that can be obtainednon-invasively, e.g., circulating tumor (ctDNA) samples, which canobtained from peripheral blood, or invasively, e.g., tissue sections. Insuch samples, the mutant sequences may only be present at a very limitedcopy number (e.g., less than 10, less than 5 copies or even 1 copy in abackground of hundreds or thousands of copies of the wild type sequence)and there is a high probability that at least some of the mutantfragments have an otherwise identical sequence (including identicalends) to a wild type fragment. In these situations, it can be almostimpossible to identify a sequence variation with significant confidence.

The present method—because it involves splitting, and tagging, thesample after it is tagged with a “generic” asymmetric adaptor—hasmultiple advantages over the prior methods, e.g., the Schmitt methodsummarized above. For example, sequence tags used for each sub-samplecan be error-correctable rather than random sequences. Therefore errorsin the tag sequences can be recovered, retaining many reads that wouldotherwise be misassigned or rejected by the analysis pipeline. Second,manufacture of adaptors in the present method is more straightforwardthan the method used by Schmitt. Adaptors can be manufactured byannealing oligonucleotides thereby avoiding enzymatic steps andsubsequent purification and quality controls steps. Third, unlikedegenerate oligonucleotide synthesis, non-degenerate oligonucleotides donot have synthesis biases. Fourth, because tag sequences can bedesigned, rather than random, they can be of specified length andsequence composition. For example, this can allow design of specificblocking sequences, tailoring the length of tags to a specificsequencing platform or avoiding secondary structure(s). Fifth, genericadaptors do not include tag sequence(s) therefore residual adaptorstrands cannot inadvertently hybridize to a molecule and add a differenttag sequence. Sixth, tag combinations can be rationally designed ratherthan random combinations of different 5′ and 3′ random tag sequences.Therefore tag combinations can be chosen to minimize unevenamplification of template molecules.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is best understood from the following detailed descriptionwhen read in conjunction with the accompanying drawings. It isemphasized that, according to common practice, the various features ofthe drawings are not to scale. Indeed, the dimensions of the variousfeatures are arbitrarily expanded or reduced for clarity. Included inthe drawings are the following figures.

FIG. 1 schematically illustrates some of the principles of the presentmethod.

FIG. 2 overview of workflow of one implementation of the present method.(P5 and P7 are Illumina flow cell sequences. i1 is index 1. R1 and R2are read 1 and read 2 sequences. Prime indicates a sequence is reversecomplemented. The 3′ end of a strand is indicated by an arrow. a, b, c,d represent regions of a genomic fragment.

FIG. 3 illustrates a split-pool method. Two double-stranded librarymolecules, indicated by dashed lines, have the same 5′ and 3′breakpoints. The library is split between multiple labelling reactions(barcodes 1, 2, 3 and 4 at the 5′ and 3′ ends of library molecules).Labelled reactions are then pooled. The two dashed line librarymolecules are associated with different barcodes so, even though theyhave the same 5′ and 3′ breakpoints, individual molecules can beidentified.

FIG. 4 shows a flow chart illustrating a bioinformatics workflow.

DEFINITIONS

Unless defined otherwise herein, all technical and scientific terms usedherein have the same meaning as commonly understood by one of ordinaryskill in the art to which this invention belongs. Although any methodsand materials similar or equivalent to those described herein can beused in the practice or testing of the present invention, the preferredmethods and materials are described.

All patents and publications, including all sequences disclosed withinsuch patents and publications, referred to herein are expresslyincorporated by reference.

Numeric ranges are inclusive of the numbers defining the range. Unlessotherwise indicated, nucleic acids are written left to right in 5′ to 3′orientation; amino acid sequences are written left to right in amino tocarboxy orientation, respectively.

The headings provided herein are not limitations of the various aspectsor embodiments of the invention. Accordingly, the terms definedimmediately below are more fully defined by reference to thespecification as a whole.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Singleton, et al., DICTIONARYOF MICROBIOLOGY AND MOLECULAR BIOLOGY, 2D ED., John Wiley and Sons, NewYork (1994), and Hale & Markham, THE HARPER COLLINS DICTIONARY OFBIOLOGY, Harper Perennial, N.Y. (1991) provide one of skill with thegeneral meaning of many of the terms used herein. Still, certain termsare defined below for the sake of clarity and ease of reference.

The term “sample” as used herein relates to a material or mixture ofmaterials, typically containing one or more analytes of interest. In oneembodiment, the term as used in its broadest sense, refers to any plant,animal or viral material containing DNA or RNA, such as, for example,tissue or fluid isolated from an individual (including withoutlimitation plasma, serum, cerebrospinal fluid, lymph, tears, saliva andtissue sections) or from in vitro cell culture constituents, as well assamples from the environment.

The term “nucleic acid sample,” as used herein, denotes a samplecontaining nucleic acids. Nucleic acid samples used herein may becomplex in that they contain multiple different molecules that containsequences. Genomic DNA samples from a mammal (e.g., mouse or human) aretypes of complex samples. Complex samples may have more than about 10⁴,10⁵, 10⁶ or 10⁷, 10⁸, 10⁹ or 10¹⁰ different nucleic acid molecules. ADNA target may originate from any source such as genomic DNA, or anartificial DNA construct. Any sample containing nucleic acid, e.g.,genomic DNA from tissue culture cells or a sample of tissue, may beemployed herein.

The term “mixture” as used herein, refers to a combination of elements,that are interspersed and not in any particular order. A mixture isheterogeneous and not spatially separable into its differentconstituents. Examples of mixtures of elements include a number ofdifferent elements that are dissolved in the same aqueous solution and anumber of different elements attached to a solid support at randompositions (i.e., in no particular order). A mixture is not addressable.To illustrate by example, an array of spatially separated surface-boundpolynucleotides, as is commonly known in the art, is not a mixture ofsurface-bound polynucleotides because the species of surface-boundpolynucleotides are spatially distinct and the array is addressable.

The term “nucleotide” is intended to include those moieties that containnot only the known purine and pyrimidine bases, but also otherheterocyclic bases that have been modified. Such modifications includemethylated purines or pyrimidines, acylated purines or pyrimidines,alkylated riboses or other heterocycles. In addition, the term“nucleotide” includes those moieties that contain hapten or fluorescentlabels and may contain not only conventional ribose and deoxyribosesugars, but other sugars as well. Modified nucleosides or nucleotidesalso include modifications on the sugar moiety, e.g., wherein one ormore of the hydroxyl groups are replaced with halogen atoms or aliphaticgroups, or are functionalized as ethers, amines, or the like.

The term “nucleic acid” and “polynucleotide” are used interchangeablyherein to describe a polymer of any length, e.g., greater than about 2bases, greater than about 10 bases, greater than about 100 bases,greater than about 500 bases, greater than 1000 bases, greater than10,000 bases, greater than 100,000 bases, greater than about 1,000,000,up to about 10¹⁰ or more bases composed of nucleotides, e.g.,deoxyribonucleotides or ribonucleotides, and may be producedenzymatically or synthetically (e.g., PNA as described in U.S. Pat. No.5,948,902 and the references cited therein) which can hybridize withnaturally occurring nucleic acids in a sequence specific manneranalogous to that of two naturally occurring nucleic acids, e.g., canparticipate in Watson-Crick base pairing interactions.Naturally-occurring nucleotides include guanine, cytosine, adenine,thymine, uracil (G, C, A, T and U respectively). DNA and RNA have adeoxyribose and ribose sugar backbone, respectively, whereas PNA'sbackbone is composed of repeating N-(2-aminoethyl)-glycine units linkedby peptide bonds. In PNA various purine and pyrimidine bases are linkedto the backbone by methylenecarbonyl bonds. A locked nucleic acid (LNA),often referred to as inaccessible RNA, is a modified RNA nucleotide. Theribose moiety of an LNA nucleotide is modified with an extra bridgeconnecting the 2′ oxygen and 4′ carbon. The bridge “locks” the ribose inthe 3′-endo (North) conformation, which is often found in the A-formduplexes. LNA nucleotides can be mixed with DNA or RNA residues in theoligonucleotide whenever desired. The term “unstructured nucleic acid,”or “UNA,” is a nucleic acid containing non-natural nucleotides that bindto each other with reduced stability. For example, an unstructurednucleic acid may contain a G′ residue and a C′ residue, where theseresidues correspond to non-naturally occurring forms, i.e., analogs, ofG and C that base pair with each other with reduced stability, butretain an ability to base pair with naturally occurring C and Gresidues, respectively. Unstructured nucleic acid is described inUS20050233340, which is incorporated by reference herein for disclosureof UNA.

The term “oligonucleotide” as used herein denotes a single-strandedmultimer of nucleotide of from about 2 to 200 nucleotides, up to 500nucleotides in length. Oligonucleotides may be synthetic or may be madeenzymatically, and, in some embodiments, are 30 to 150 nucleotides inlength. Oligonucleotides may contain ribonucleotide monomers (i.e., maybe oligoribonucleotides) or deoxyribonucleotide monomers, or bothribonucleotide monomers and deoxyribonucleotide monomers. Anoligonucleotide may be 10 to 20, 21 to 30, 31 to 40, 41 to 50, 51 to 60,61 to 70, 71 to 80, 80 to 100, 100 to 150 or 150 to 200 nucleotides inlength, for example.

“Primer” means an oligonucleotide, either natural or synthetic, that iscapable, upon forming a duplex with a polynucleotide template, of actingas a point of initiation of nucleic acid synthesis and being extendedfrom its 3′ end along the template so that an extended duplex is formed.The sequence of nucleotides added during the extension process isdetermined by the sequence of the template polynucleotide. Usuallyprimers are extended by a DNA polymerase. Primers are generally of alength compatible with their use in synthesis of primer extensionproducts, and are usually in the range of between 8 to 100 nucleotidesin length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21to 50, 22 to 45, 25 to 40, and so on. Typical primers can be in therange of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30,21-25 and so on, and any length between the stated ranges. In someembodiments, the primers are usually not more than about 10, 12, 15, 20,21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or70 nucleotides in length.

Primers are usually single-stranded for maximum efficiency inamplification, but may alternatively be double-stranded or partiallydouble-stranded. If double-stranded, the primer is usually first treatedto separate its strands before being used to prepare extension products.This denaturation step is typically effected by heat, but mayalternatively be carried out using alkali, followed by neutralization.Also included in this definition are toehold exchange primers, asdescribed in Zhang et al (Nature Chemistry 2012 4: 208-214), which isincorporated by reference herein.

Thus, a “primer” is complementary to a template, and complexes byhydrogen bonding or hybridization with the template to give aprimer/template complex for initiation of synthesis by a polymerase,which is extended by the addition of covalently bonded bases linked atits 3′ end complementary to the template in the process of DNAsynthesis.

The term “hybridization” or “hybridizes” refers to a process in which aregion of nucleic acid strand anneals to and forms a stable duplex,either a homoduplex or a heteroduplex, under normal hybridizationconditions with a second complementary nucleic acid strand, and does notform a stable duplex with unrelated nucleic acid molecules under thesame normal hybridization conditions. The formation of a duplex isaccomplished by annealing two complementary nucleic acid strand regionin a hybridization reaction. The hybridization reaction can be made tobe highly specific by adjustment of the hybridization conditions (oftenreferred to as hybridization stringency) under which the hybridizationreaction takes place, such that two nucleic acid strands will not form astable duplex, e.g., a duplex that retains a region ofdouble-strandedness under normal stringency conditions, unless the twonucleic acid strands contain a certain number of nucleotides in specificsequences which are substantially or completely complementary. “Normalhybridization or normal stringency conditions” are readily determinedfor any given hybridization reaction. See, for example, Ausubel et al.,Current Protocols in Molecular Biology, John Wiley & Sons, Inc., NewYork, or Sambrook et al., Molecular Cloning: A Laboratory Manual, ColdSpring Harbor Laboratory Press. As used herein, the term “hybridizing”or “hybridization” refers to any process by which a strand of nucleicacid binds with a complementary strand through base pairing.

A nucleic acid is considered to be “selectively hybridizable” to areference nucleic acid sequence if the two sequences specificallyhybridize to one another under moderate to high stringency hybridizationand wash conditions. Moderate and high stringency hybridizationconditions are known (see, e.g., Ausubel, et al., Short Protocols inMolecular Biology, 3rd ed., Wiley & Sons 1995 and Sambrook et al.,Molecular Cloning: A Laboratory Manual, Third Edition, 2001 Cold SpringHarbor, N.Y.). One example of high stringency conditions includehybridization at about 42° C. in 50% formamide, 5×SSC, 5×Denhardt'ssolution, 0.5% SDS and 100 μg/ml denatured carrier DNA followed bywashing two times in 2×SSC and 0.5% SDS at room temperature and twoadditional times in 0.1×SSC and 0.5% SDS at 42° C.

The term “duplex,” or “duplexed,” as used herein, describes twocomplementary polynucleotide region that are base-paired, i.e.,hybridized together.

The term “amplifying” as used herein refers to the process ofsynthesizing nucleic acid molecules that are complementary to one orboth strands of a template nucleic acid. Amplifying a nucleic acidmolecule may include denaturing the template nucleic acid, annealingprimers to the template nucleic acid at a temperature that is below themelting temperatures of the primers, and enzymatically elongating fromthe primers to generate an amplification product. The denaturing,annealing and elongating steps each can be performed one or more times.In certain cases, the denaturing, annealing and elongating steps areperformed multiple times such that the amount of amplification productis increasing, often times exponentially, although exponentialamplification is not required by the present methods. Amplificationtypically requires the presence of deoxyribonucleoside triphosphates, aDNA polymerase enzyme and an appropriate buffer and/or co-factors foroptimal activity of the polymerase enzyme. The term “amplificationproduct” refers to the nucleic acids, which are produced from theamplifying process as defined herein.

The terms “determining,” “measuring,” “evaluating,” “assessing,”“assaying,” and “analyzing” are used interchangeably herein to refer toany form of measurement, and include determining if an element ispresent or not. These terms include both quantitative and/or qualitativedeterminations. Assessing may be relative or absolute. “Assessing thepresence of” includes determining the amount of something present, aswell as determining whether it is present or absent.

The term “copies of fragments” refers to the product of amplification,where a copy of a fragment can be a reverse complement of a strand of afragment, or have the same sequence as a strand of a fragment.

The term “substantially identical sequences” refers to sequence that areat least 95% or at least 99% identical to one another.

The term “using” has its conventional meaning, and, as such, meansemploying, e.g., putting into service, a method or composition to attainan end. For example, if a program is used to create a file, a program isexecuted to make a file, the file usually being the output of theprogram. In another example, if a computer file is used, it is usuallyaccessed, read, and the information stored in the file employed toattain an end. Similarly if a unique identifier, e.g., a barcode isused, the unique identifier is usually read to identify, for example, anobject or file associated with the unique identifier.

The term “ligating,” as used herein, refers to the enzymaticallycatalyzed joining of the terminal nucleotide at the 5′ end of a firstDNA molecule to the terminal nucleotide at the 3′ end of a second DNAmolecule.

A “plurality” contains at least 2 members. In certain cases, a pluralitymay have at least 2, at least 5, at least 10, at least 100, at least100, at least 10,000, at least 100,000, at least 10⁶, at least 10⁷, atleast 10⁸ or at least 10⁹ or more members.

If two nucleic acids are “complementary,” they hybridize with oneanother under high stringency conditions. The term “perfectlycomplementary” is used to describe a duplex in which each base of one ofthe nucleic acids base pairs with a complementary nucleotide in theother nucleic acid. In many cases, two sequences that are complementaryhave at least 10, e.g., at least 12 or 15 nucleotides ofcomplementarity.

An “oligonucleotide binding site” refers to a site to which anoligonucleotide hybridizes in a target polynucleotide. If anoligonucleotide “provides” a binding site for a primer, then the primermay hybridize to that oligonucleotide or its complement.

The term “strand” as used herein refers to a nucleic acid made up ofnucleotides covalently linked together by covalent bonds, e.g.,phosphodiester bonds. In a cell, DNA usually exists in a double-strandedform, and as such, has two complementary strands of nucleic acidreferred to herein as the “top” and “bottom” strands. In certain cases,complementary strands of a chromosomal region may be referred to as“plus” and “minus” strands, the “first” and “second” strands, the“coding” and “noncoding” strands, the “Watson” and “Crick” strands orthe “sense” and “antisense” strands. The assignment of a strand as beinga top or bottom strand is arbitrary and does not imply any particularorientation, function or structure. The nucleotide sequences of thefirst strand of several exemplary mammalian chromosomal regions (e.g.,BACs, assemblies, chromosomes, etc.) is known, and may be found inNCBI's Genbank database, for example.

The term “top strand,” as used herein, refers to either strand of anucleic acid but not both strands of a nucleic acid. When anoligonucleotide or a primer binds or anneals “only to a top strand,” itbinds to only one strand but not the other. The term “bottom strand,” asused herein, refers to the strand that is complementary to the “topstrand.” When an oligonucleotide binds or anneals “only to one strand,”it binds to only one strand, e.g., the first or second strand, but notthe other strand.

The term “extending”, as used herein, refers to the extension of aprimer by the addition of nucleotides using a polymerase. If a primerthat is annealed to a nucleic acid is extended, the nucleic acid acts asa template for extension reaction.

The term “sequencing,” as used herein, refers to a method by which theidentity of at least 10 consecutive nucleotides (e.g., the identity ofat least 20, at least 50, at least 100 or at least 200 or moreconsecutive nucleotides) of a polynucleotide is obtained.

The terms “next-generation sequencing” or “high-throughput sequencing”,as used herein, refer to the so-called parallelizedsequencing-by-synthesis or sequencing-by-ligation platforms currentlyemployed by Illumina, Life Technologies, and Roche, etc. Next-generationsequencing methods may also include nanopore sequencing methods such asthat commercialized by Oxford Nanopore Technologies,electronic-detection based methods such as Ion Torrent technologycommercialized by Life Technologies, or single-moleculefluorescence-based methods such as that commercialized by PacificBiosciences.

The term “bidirectional sequencing”, as used herein, refers tosequencing the top and bottom strands of an initial fragment of doublestranded DNA in spatially distinct sequencing reactions, where the topand bottom sequence reads can be paired with each other and comparedduring data analysis. Paired-end sequencing, on the other hand, is notbidirectional sequencing because, in paired end sequencing, both ends ofthe sequenced amplicon are derived from only one strand of an initialfragment.

The term “asymmetric adaptor”, as used herein, refers to an adaptorthat, when ligated to both ends of a double stranded nucleic acidfragment, will lead to a top strand that contains a 5′ tag sequence thatis not the same as or complementary to the tag sequence at the 3′ end.Exemplary asymmetric adapters are described in: U.S. Pat. Nos. 5,712,126and 6,372,434 and WO/2009/032167; all of which are incorporated byreference herein in their entirety. An asymmetrically tagged fragmentcan be amplified by two primers: one that hybridizes to a first tagsequence added to the 3′ end of a strand, and another that hybridizes tothe complement of a second tag sequence added to the 5′ end of a strand.Y-adaptors and hairpin adaptors (which can be cleaved, after ligation,to produce a “Y-adaptor”) are examples of asymmetric adaptors.

The term “Y-adaptor” refers to an adaptor that contains: adouble-stranded region and a single-stranded region in which theopposing sequences are not complementary. The end of the double-strandedregion can be joined to target molecules such as double-strandedfragments of genomic DNA, e.g., by ligation or a transposase-catalyzedreaction. Each strand of an adaptor-tagged double-stranded DNA that hasbeen ligated to a Y-adaptor is asymmetrically tagged in that it has thesequence of one strand of the Y-adaptor at one end and the other strandof the Y-adaptor at the other end. Amplification of nucleic acidmolecules that have been joined to Y-adaptors at both ends results in anasymmetrically tagged nucleic acid, i.e., a nucleic acid that has a 5′end containing one tag sequence and a 3′ end that has another tagsequence.

The term “hairpin adaptor” refers to an adaptor that is in the form of ahairpin. In one embodiment, after ligation the hairpin loop can becleaved to produce strands that have non-complementary tags on the ends.In some cases, the loop of a hairpin adaptor may contain a uracilresidue, and the loop can be cleaved using uracil DNA glycosylase andendonuclease VIII, although other methods are known.

The term “adaptor-ligated sample”, as used herein, refers to a samplethat has been ligated to an adaptor. As would be understood given thedefinitions above, a sample that has been ligated to an asymmetricadaptor contains strands that have non-complementary sequences at the 5′and 3′ ends.

The term “splitting”, as used herein, refers to an action in which twoor more portions of an initial sample are separated from one anotherand, e.g., placed into separate vessels. In some cases, the term“splitting” means that a sample is divided into equal parts. In othercases, the term “splitting” means that a sample is divided into unequalparts. In some cases, the term “splitting” means that several aliquotsof an initial sample are removed from one vessel and placed into othervessels, where not all of an initial sample is aliquoted into the othervessels. In these cases, the portion of the initial sample that remainsin the original vessel may be used as a sub-sample, as described below.

The term “a plurality of sub-samples”, as used herein, refers to theproduct of a sample that has been split. For example, if two portions ofan initial sample are removed from one vessel and placed into two othervessels, then there are two sub-samples.

The term “tagging” as used herein, refers to the appending of a sequencetag (that contains an identifier sequence) onto a nucleic acid molecule.A sequence tag may be added to the 5′ end, the 3′ end, or both ends ofnucleic acid molecule. A sequence tag can be added to a fragment byligating an oligonucleotide to the fragment. In some cases, ligation toa single stranded end of a nucleic acid may be facilitated by a splintoligonucleotide, where a “splint oligonucleotide” refers to anoligonucleotide that, when hybridized to other polynucleotides, acts asa “splint” to position the polynucleotides next to one another so thatthey can be ligated together using, e.g., T4 DNA or another ligase. Inother embodiments, ligation to a single stranded end of a nucleic acidmay be facilitated by a single-strand DNA ligase. A sequence can also beadded by primer extension when the primer comprises, at its 3′ end, asequence that binds to a sequence in the fragment and, at its 5′ end,the tag sequence.

The terms “identifier sequence” and “tag sequence that identifies” areused interchangeably herein to refer to a sequence of nucleotides usedto identify and/or track the source of a polynucleotide in a reaction.After the polynucleotides in a sample are sequenced, the identifiersequence can be used to distinguish the sequence reads and/or determinefrom which sample a sequence read is derived. An “identifier sequence”may be referred to as a “sample barcode”, “index” or “indexer” sequencein other publications. For example, different samples (e.g.,polynucleotides derived from different individuals, different tissues orcells, or polynucleotides isolated at different times points), can betagged with identifier sequences that are different from one anotherand, after the samples are tagged, they are pooled. After sequencing,the source of a sequence can be tracked back to a particular sampleusing the identifier sequence. Identifier sequences can be added to asample by ligation, by primer extension using a tailed primer thatcontains an identifier sequence in a 5′ tail, or using a transposon. Anidentifier sequence can range in length from 2 to 100 nucleotide basesor more and may include multiple subunits, where each differentidentifier has a distinct identity and/or order of subunits. A sampleidentifier sequence may be added to the 5′ end of a polynucleotide, the3′ end of a polynucleotide or both the 5′ and 3′ end of apolynucleotide, for example. In particular embodiments, a barcodesequence may have a length in range of from 1 to 36 nucleotides, e.g.,from 6 to 30 nucleotides, or 8 to 20 nucleotides. In certain cases, themolecular identifier sequence may be error-correcting, meaning that evenif there is an error (e.g., if the sequence of the molecular barcode ismis-synthesized, mis-read or is distorted by virtue of the variousprocessing steps leading up to the determination of the molecularbarcode sequence) then the code can still be interpreted correctly.Descriptions of exemplary error correcting sequences can be foundthroughout the literature (e.g., US20100323348 and US20090105959, whichare both incorporated herein by reference). In some embodiments, anidentifier sequence may be of relatively low complexity (e.g., may becomposed of a mixture of 4 to 1024 different sequences), although highercomplexity identifier sequences can be used in some cases.

The term “sample identifier sequence” is a sequence of nucleotides thatis appended to a target polynucleotide, where the sequence identifiesthe sample (e.g., which individual, which cell, which tissue, or whichtimes points, etc.) from which a sequence read is derived. In use, eachsample is tagged with a different sample identifier sequence (e.g., onesequence is appended to each sample, where the different samples areappended to different sequences), and the tagged samples can be pooled.After the samples are sequenced, the sample identifier sequence can beused to identify the source of the sequences.

The term “sub-sample identifier sequence” and “sequence tag thatidentifies the sub-sample” are used interchangeably herein to refers toa sequence of nucleotides that is appended to a target polynucleotide(for example, by primer extension), where the sequence sub-sampleidentifier sequence allows one to distinguish between the varioussub-samples. (e.g., which of the 2, 4, 6, 8, or 12 or more sub-samples)from which a sequence read is derived. In use, each sub-sample is taggedwith a different sub-sample identifier sequence (e.g., one sequence isappended to each sub-sample, where the different sub-samples areappended to different sequences), or combination of sub-sampleidentifier sequences, and the tagged sub-samples can be optionallypooled. After sequencing, the sub-sample identifier sequences can beused to distinguish between sequences from one sub-sample and sequencesfrom other sub-samples. As would be apparent, if “sequence tags thatidentify sub-samples” are used, the sequence tag appended to eachsub-sample is different.

As used herein, the term “complementary” in the context of sequencereads that are complementary, refers to reads for sequences that, afterthe sequences have been trimmed to remove adaptor sequences, aresubstantially complementary to one another and, in some cases, haveidentical or near identical ends, indicating that the reads are derivedfrom the same initial template molecules.

The term “opposite strands”, as used herein, refers to the top andbottom strands, where the strands are complementary to one another.

The term “potential sequence variation”, as used herein, refers to asequence variation, e.g., a substitution, deletion, insertion orrearrangement of one or more nucleotides in one sequence relative toanother.

As used herein, the term “correspond to”, with reference to a sequenceread that corresponds to a particular (e.g., the top or bottom) strandof a fragment, refers to a sequence read derived from that strand or anamplification product thereof.

The term “identical or near-identical sequences”, as used herein, refersto near duplicate sequences, as measured by a similarity function,including but not limited to a Hamming distance, Levenshtein distance,Jaccard distance, cosine distance etc. (see, generally, Kemena et al,Bioinformatics 2009 25: 2455-65). The exact threshold depends on theerror rate of the sample preparation and sequencing used to perform theanalysis, with higher error rates requiring lower thresholds ofsimilarity. In certain cases, substantially identical sequences have atleast 98% or at least 99% sequence identity.

The term “fragmentation breakpoint” is intended to refer to the site atwhich a nucleic acid fragment is joined to an adaptor. Two sequencesthat have the same fragmentation breakpoints have the same sequences attheir ends (excluding any adaptor sequences that have been added to thefragments). Fragments can be generated by random or non-random methods.In analyzing sequence reads, the fragmentation breakpoint may beidentified as the boundary between genomic-derived sequence and adaptorderived sequences (including any overhangs in adaptor sequences).

The term “identical or near-identical fragmentation breakpoints”, asused herein, refers to two molecules that have the same 5′ end, the same3′ end, or the same 5′ and 3′ ends, where the differences are due to aPCR error, sequencing error, mapping or alignment error or somaticmutation. A fragmentation breakpoint can be determined by removing theadaptor sequence from a sequence read, leaving the sequence of thetarget. The first nucleotide of the trimmed sequence represents thefirst nucleotide after the fragmentation breakpoint. In sequencing anamplified sample, two sequence reads that correspond to fragments thathave identical or near-identical fragmentation breakpoints can bederived from the same initial fragment. In many cases, 8-30 nucleotidesat the end of a trimmed sequence can be compared to the ends of othertrimmed sequences to determine if the fragmentation breakpoints are thesame or different. In many cases, fragmentation breakpoints can beidentified after mapping reads to a reference sequence. After mappingfragmentation breakpoints may be identified using software e.g., PicardMarkDuplicates (available from the Broad Institute), Samtools rmdup(see, e.g., Li et al. Bioinformatics 2009, 25: 2078-2079) and BioBamBam(Tischler et al, Source Code for Biology and Medicine 2014, 9:13).

The term “pooling”, as used herein, refers to the combining, e.g.,mixing, of two samples such that the molecules within those samplesbecome interspersed with one another in solution.

The term “pooled sample”, as used herein, refers to the product ofpooling.

The term “target enrichment”, as used herein, refers to a method inwhich selected sequences are separated from other sequences in a sample.This may be done by hybridization to a probe, e.g., hybridizing abiotinylated oligonucleotide to the sample to produce duplexes betweenthe oligonucleotide and the target sequence, immobilizing the duplexesvia the biotin group, washing the immobilized duplexes, and thenreleasing the target sequences from the oligonucleotides. Alternatively,a selected sequence may be enriched by amplifying that sequence, e.g.,by PCR using one or more primers that hybridize to a site that isproximal to the target sequence.

The terms “minority variant” and “sequence variation”, as used herein,is a variant that is present a frequency of less than 50%, relative toother molecules in the sample. In some cases, a minority variant may bea first allele of a polymorphic target sequence, where, in a sample, theratio of molecules that contain the first allele of the polymorphictarget sequence compared to molecules that contain other alleles of thepolymorphic target sequence is 1:100 or less, 1:1,000 or less, 1: 10,000or less, 1:100,000 or less or 1:1,000,000 or less.

The term “sequence diversity”, as used herein, refers to the number of5′ and/or 3′ breakpoints that are associated with a plurality offragments corresponding to a target sequence.

Other definitions of terms may appear throughout the specification.

It is further noted that the claims may be drafted to exclude anyoptional element. As such, this statement is intended to serve asantecedent basis for use of such exclusive terminology as “solely”,“only” and the like in connection with the recitation of claim elements,or the use of a “negative” limitation.

DETAILED DESCRIPTION OF THE INVENTION

Before the present invention is described, it is to be understood thatthis invention is not limited to particular embodiments described, assuch may, of course, vary. It is also to be understood that theterminology used herein is for the purpose of describing particularembodiments only, and is not intended to be limiting, since the scope ofthe present invention will be limited only by the appended claims.

Where a range of values is provided, it is understood that eachintervening value, to the tenth of the unit of the lower limit unlessthe context clearly dictates otherwise, between the upper and lowerlimits of that range is also specifically disclosed. Each smaller rangebetween any stated value or intervening value in a stated range and anyother stated or intervening value in that stated range is encompassedwithin the invention. The upper and lower limits of these smaller rangesmay independently be included or excluded in the range, and each rangewhere either, neither or both limits are included in the smaller rangesis also encompassed within the invention, subject to any specificallyexcluded limit in the stated range. Where the stated range includes oneor both of the limits, ranges excluding either or both of those includedlimits are also included in the invention.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Although any methods andmaterials similar or equivalent to those described herein can be used inthe practice or testing of the present invention, some potential andpreferred methods and materials are now described. All publicationsmentioned herein are incorporated herein by reference to disclose anddescribe the methods and/or materials in connection with which thepublications are cited. It is understood that the present disclosuresupersedes any disclosure of an incorporated publication to the extentthere is a contradiction.

It must be noted that as used herein and in the appended claims, thesingular forms “a”, “an”, and “the” include plural referents unless thecontext clearly dictates otherwise. Thus, for example, reference to “anucleic acid” includes a plurality of such nucleic acids and referenceto “the compound” includes reference to one or more compounds andequivalents thereof known to those skilled in the art, and so forth.

The practice of the present invention may employ, unless otherwiseindicated, conventional techniques and descriptions of organicchemistry, polymer technology, molecular biology (including recombinanttechniques), cell biology, biochemistry, and immunology, which arewithin the skill of the art. Such conventional techniques includepolymer array synthesis, hybridization, ligation, and detection ofhybridization using a label. Specific illustrations of suitabletechniques can be had by reference to the example herein below. However,other equivalent conventional procedures can, of course, also be used.Such conventional techniques and descriptions can be found in standardlaboratory manuals such as Genome Analysis: A Laboratory Manual Series(Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A LaboratoryManual, PCR Primer: A Laboratory Manual, and Molecular Cloning: ALaboratory Manual (all from Cold Spring Harbor Laboratory Press),Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, N.Y., Gait,“Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press,London, Nelson and Cox (2000), Lehninger, A., Principles of Biochemistry3^(rd) Ed., W. H. Freeman Pub., New York, N.Y. and Berg et al. (2002)Biochemistry, 5^(th) Ed., W. H. Freeman Pub., New York, N.Y., all ofwhich are herein incorporated in their entirety by reference for allpurposes.

The publications discussed herein are provided solely for theirdisclosure prior to the filing date of the present application. Nothingherein is to be construed as an admission that the present invention isnot entitled to antedate such publication by virtue of prior invention.Further, the dates of publication provided may be different from theactual publication dates which may need to be independently confirmed.

Some of the principles of the method are shown in FIG. 1 . Withreference to FIG. 1 , the method can be initiated by ligating anasymmetric adaptor 2 to a sample that comprises fragments of DNA 4, toproduce an adaptor-ligated sample 6. As shown, asymmetric adaptor 2 canbe a Y adaptor, although other types of asymmetric adaptors (e.g., ahairpin adaptor) can also be used. Adaptor-ligated sample 6 containsasymmetrically-tagged fragments 8 and 10, meaning that for each strand,the 5′ end tag (A) contains a sequence that is different and notcomplementary to the sequence of the 3′ end tag (B). As would beunderstood, depending on the sample, this step may be done by repairingthe ends of the fragments in the sample, adding an “A” overhang, andthen ligating the end-repaired fragments to an asymmetric adaptor thatcontains a “T” overhang, although other methods can be used. Next, themethod comprises splitting the adaptor-ligated sample into a pluralityof sub-samples 11. In this step, the adaptor-ligated sample may be splitinto as many samples as necessary. However, in many embodiments, theadaptor-ligated sample is split into 4 to 96 different sub-samples,e.g., 4 to 16 different sub-samples. As would be readily apparent, thisstep may be done by removing several aliquots of the adaptor-ligatedsample and placing the aliquots in to separate containers.

Next, the method comprises separately tagging the adaptor-ligated DNA inthe sub-samples with sequence tags that identify the sub-samples. Insome embodiments, this may be done by ligation (e.g., using a splintoligonucleotide or a single-stranded ligase to facilitate ligation) orby primer extension. In some embodiments (and as illustrated in FIG. 1), the tagging may be done by amplifying the adaptor-ligated DNA in eachof the sub-samples using primers that hybridize to sequences in theasymmetric adaptor (primers 16 and 18), wherein at least one of theprimers has an identifier sequence, and the identifier sequence or, iftwo primers have identifier sequences, the combination of identifiersequences, can uniquely identify a sub-sample, to produce a taggedproduct, e.g., an amplification product for each sub-sample (20 and 22).As shown, both primers 16 and 18 have a sub-sample identifier sequencein their 5′ tail, where sub-sample identifier sequence m can be anysuitable sequence that is unique for each sub-sample, and sequence n canbe any suitable sequence that is unique for each sub-sample. Suchprimers contain a 5′ tail that contains the sub-sample identifiersequence and a 3′ end that hybridizes to the adaptor added to thefragments and is extended using the fragments as a template, therebyproducing a copy of the fragments. Using this method, the copiedfragments will contain the molecular barcode at the 5′ end. For example,in primer 16, sub-sample identifier sequence m can be w or y and, inprimer 18, sub-sample identifier sequence n can be x′ (i.e., thecomplement of x) or z′ (i.e., the complement of z). The method willstill work if only one of the primers has a sub-sample identifiersequence. The primer extension method shown in FIG. 1 can be readilyadapted to ligation-based methods, e.g., by ligating on oligonucleotidesthat contain sub-sample identifiers, as discussed above, and amplifyingthe ligation products (which could be pooled together immediately afterthe ligation reaction) using generic primers to obtain a similar set ofamplification products. Alternatively, the nucleic acid in eachsub-sample could be amplified by PCR, and the PCR products of sub-samplecould be ligated to a sub-sample identifier.

This tagging step can be done by PCR, using a limited number of cycles(e.g., 4 to 20 cycles), to produce tagged products 20 and 22. As shown,subsample 12 is amplified using primers having tails of sequences w andx′, and subsample 14 is amplified using primers having tails ofsequences y and z′. In tagged sub-sample 20, products derived from thetop strand of the molecules in sample 12 will have a top strand offormula w-C-x, and tagged sub-samples derived from the bottom strand ofthe molecules in sample 12 will have a bottom strand of formula w-C′-x(where C′ is the reverse complement of C). Likewise, in taggedsub-sample 22, tagged products derived from the top strand of themolecules in sample 14 will have a top strand of formula y-C-z, andtagged sub-sample products derived from the bottom strand of themolecules in sample 14 will have a bottom strand of formula y-C′-z. Aswould be apparent, in the sets of tagged sub-samples 20 and 22, the topand bottom strand sequences are distinguishable. For example, in taggedsub-sample 20, the top strand products are of formula w-C-x, whereas thebottom strand products are of formula w-C′-x, which allows the topstrand products to be distinguishable from the bottom strand productsafter sequencing. Likewise, in tagged sub-sample 22, the top strandproducts are of formula y-C-z, whereas the bottom strand products are offormula y-C′-z, which allows the top strand products to bedistinguishable from the bottom strand products after sequencing.

Next, after an optional pooling step in which the tagged sub-samples 20and 22 can be pooled together, the method comprises sequencingpolynucleotides from each of the tagged sub-samples, or copies of thesame (if those sequences are amplified), to produce sequence reads 30each comprising: (i). a sub-sample identifier sequence (w, x, y or z ortheir complements) and (ii). the sequence of at least part of a fragmentin the sample (c or c′). As shown, some of the sequence reads arederived from the top strand of a fragment in the sample and some of thesequence reads are derived from the bottom strand of the same fragment.In other words, at least some of the sequence reads, excluding thesub-sample identifier sequences, are substantially complementary (i.e.,some of the sequence reads contain sequence c and some sequence readscontain sequence c′) and correspond to opposite strands (i.e., the topand bottom strands) of a fragment in the sample of (a). In theillustrative example shown in in FIG. 1 , two identical fragments insample 4 would result in four types of sequence reads 32, 34, 36 and 38.In these reads, identical or near identical sequence reads derived fromdifferent starting molecules in sample 4 can be distinguished by theiridentifiers (i.e., whether the reads contain sub-sample identifiers wand x or identifiers y and z), and, for sequence reads that contain thesame sub-sample identifiers, the strand from which the sequence read isderived (i.e., whether a read is derived from the top strand or thebottom strand of a fragment in sample 4) can be determined by thepolarity of the sequence corresponding to that fragment (i.e., whetherthe sequence is sequence c or it's complement c′). True mutations shouldbe at the same positions in both strands and, as such, if the samemutation is seen in a bottom strand sequence (e.g., w-c-x) as well as atop strand sequence (e.g., w-c′-x), then a sequence variation is morelikely to be true. If a mutation is caused by a PCR or a sequencingerror, then it is very unlikely to appear at the same place in bothstrands.

The sequencing step may be done using any convenient next generationsequencing method and may result in at least 10,000, at least 50,000, atleast 100,000, at least 500,000, at least 1M at least 10M at least 100Mor at least 1B sequence reads. In some cases, the reads are paired-endreads. As would be apparent, the primers used for amplification may becompatible with use in any next generation sequencing platform in whichprimer extension is used, e.g., Illumina's reversible terminator method,Roche's pyrosequencing method (454), Life Technologies' sequencing byligation (the SOLiD platform), Life Technologies' Ion Torrent platformor Pacific Biosciences' fluorescent base-cleavage method. Examples ofsuch methods are described in the following references: Margulies et al(Nature 2005 437: 376-80); Ronaghi et al (Analytical Biochemistry 1996242: 84-9); Shendure (Science 2005 309: 1728); Imelfort et al (BriefBioinform. 2009 10:609-18); Fox et al (Methods Mol Biol. 2009;553:79-108); Appleby et al (Methods Mol Biol. 2009; 513:19-39) English(PLoS One. 2012 7: e47768) and Morozova (Genomics. 2008 92:255-64),which are incorporated by reference for the general descriptions of themethods and the particular steps of the methods, including all startingproducts, reagents, and final products for each of the steps.

The sequence reads may be analyzed by a computer and, as such,instructions for performing the steps set forth below may be set forthas programing that may be recorded in a suitable physical computerreadable storage medium. The general principles of some of the analysissteps are described below.

The sequence reads may be processed and grouped in any convenient way.In some embodiments, the sequence reads may be grouped by theirsub-sample identifier (e.g., whether they contain identifier w oridentifier y), and, optionally, by one or more of the fragmentationbreakpoints of the sequence read, where a fragmentation breakpoint isrepresented by the “end” of the sequence after the tags have beentrimmed off (i.e., one or more ends of sequence c or c′). Assumingfragmentation is random, or semi-random, different fragments having thesame sequence in sample 4 can be distinguished by their fragmentationbreakpoints and, as such, grouping the sequence reads by theirfragmentation breakpoints provides a way to determine if a particularsequence (e.g., a sequence variant) is present in more than one startingmolecule. In some implementations, initial processing of the sequencereads may include identification of molecular barcodes (including sampleidentifier sequences or sub-sample identifier sequences), and/ortrimming reads to remove low quality or adaptor sequences. In addition,quality assessment metrics can be run to ensure that the dataset is ofan acceptable quality.

In certain embodiments, the method may further comprise identifying apotential sequence variation in a group of sequence reads thatcorrespond to the top strand of a fragment and determining if thepotential sequence variation is in any of the sequence reads thatcorrespond to the bottom strand of the fragment. As noted above, theconfidence that a potential sequence variation is a true variation(rather than a PCR or sequencing error) increases if it is present inboth strands of the same molecule in sample 4.

In some embodiments, the method may further comprise identifyingidentical or near-identical sequence reads that have identical ornear-identical fragmentation breakpoints but different sub-sampleidentifier sequences. In these embodiments, sequence reads derived fromtwo fragments that are otherwise near identical in sequence andfragmentation breakpoints can be distinguished by their sub-sampleidentifier sequence. As would be apparent, the confidence that apotential sequence variation is a true variation (rather than a PCR orsequencing error) increases if it is present in more than one moleculein sample 4.

The ability to distinguish between: a) sequence reads that are derivedfrom different fragments and b) sequence reads that are derived fromdifferent strands of the same fragment allows one to determine whether asequence variation is real with more confidence.

As alluded to above, in some embodiments, the method may comprisepooling the tagged products prior to the sequencing step to produce apooled sample. The pool can be an equi-molar or equi-volume mix ofdifferent amplified sub-samples, alternatively. However, in someembodiments, different tagged sub-samples can be mixed at differentratios or volumes. In these embodiments, the sequencing step comprisessequencing nucleic acids in the pooled sample. Alternatively, theamplification products may sequenced without pooling. In theseembodiments, they may be separately applied to the sequencing substrateand sequenced in the same sequencing run, or sequenced in separatesequencing runs.

In some embodiments, the nucleic acids sequenced in the sequencing stepare enriched from the tagged samples by target enrichment, many methodsfor which are known. In some embodiments, the enrichment may be done byhybridization to a probe, e.g., by SURESELECT™, which may involvehybridizing the amplification products to an oligonucleotide (e.g., RNA)probe that contains an affinity tag (e.g., biotin) to the amplificationproducts. The resultant duplexes can be separated from other molecules'products by binding the oligonucleotide to a solid support and washing,and the target molecules can be released. As would be apparent, suchmethods may use at least two oligonucleotide probes: for example, probesthat target each strand. In some embodiments, however, the enrichmentstep may be done using probe that hybridize to only one of the strands(e.g., a probe that hybridizes to either the top strand sequence c orbottom strand sequence c′). In these embodiments and with reference toFIG. 1 , capturing only one strand of the tagged products using a probethat hybridizes c will enrich for products derived from the top strandof molecule 8 (of sequence w-c-x; 5′ to 3′) as well as products that arederived from the bottom strand of the same molecule (of sequencex′-c-w′; 5′ to 3′), which can be distinguished from one another aftersequencing by the orientation of c relative to w and x and theircomplements. Target enrichment can also be done using target-specificprimers by PCR amplification (see, e.g., US20130231253). In someinstances, target enrichment may be done in a single reaction or, insome cases, two reactions (one for each strand). In other cases, targetenrichment may be done in the same reaction if the probes/primers arenot overlapping.

In some embodiments, sample identifiers (i.e., a sequence thatidentifies the sample to which the sequence is added, which can identifythe patient, or a tissue, etc.) can be added to the polynucleotidesprior to sequencing, so that multiple (e.g., at least 2, at least 4, atleast 8, at least 16, at least 48, at least 96 or more) samples can bemultiplexed. In these embodiments, the sample identifier ligated may beto the initial polynucleotides as part of the asymmetric adaptor, or thesample identifier may be ligated to the polynucleotides in thesub-samples, before or after amplification of those polynucleotides.Alternatively, the tag may be added by primer extension, i.e., using aprimer that has a 3′ end that hybridizes to an adaptor (e.g., theasymmetric adaptor or a tag sequence added to the sub-samples), and a 5′tail that contains the sample identifier. For example, in someembodiments, the asymmetric adaptor may comprise a sample identifiersequence that identifies the sample to which the asymmetric adaptor isadded, wherein the amplification products each comprise a sub-sampleidentifier sequence and a sample identifier sequence, and wherein saidsequencing reads comprise the sample identifier sequence and thesub-sample identifier sequence.

The method described above can be employed to analyze genomic DNA fromvirtually any organism, including, but not limited to, plants, animals(e.g., reptiles, mammals, insects, worms, fish, etc.), tissue samples,bacteria, fungi (e.g., yeast), phage, viruses, cadaveric tissue,archaeological/ancient samples, etc. In certain embodiments, the genomicDNA used in the method may be derived from a mammal, wherein certainembodiments the mammal is a human. In exemplary embodiments, the samplemay contain genomic DNA from a mammalian cell, such as, a human, mouse,rat, or monkey cell. The sample may be made from cultured cells or cellsof a clinical sample, e.g., a tissue biopsy, scrape or lavage or cellsof a forensic sample (i.e., cells of a sample collected at a crimescene). In particular embodiments, the nucleic acid sample may beobtained from a biological sample such as cells, tissues, bodily fluids,and stool. Bodily fluids of interest include but are not limited to,blood, serum, plasma, saliva, mucous, phlegm, cerebral spinal fluid,pleural fluid, tears, lactal duct fluid, lymph, sputum, synovial fluid,urine, amniotic fluid, and semen. In particular embodiments, a samplemay be obtained from a subject, e.g., a human. In some embodiments, thesample comprises fragments of human genomic DNA. In some embodiments,the sample may be obtained from a cancer patient. In some embodiments,the sample may be made by extracting fragmented DNA from a patientsample, e.g., a formalin-fixed paraffin embedded tissue sample. In someembodiments, the patient sample may be a sample of cell-free“circulating” DNA from a bodily fluid, e.g., peripheral blood e.g. fromthe blood of a patient or of a pregnant female. The DNA fragments usedin the initial step of the method should be non-amplified DNA that hasnot been denatured beforehand.

The DNA in the initial sample may be made by extracting genomic DNA froma biological sample, and then fragmenting it. In these embodiments, theinitial steps may be mediated by a transposase (see, e.g., Caruccio,Methods Mol. Biol. 2011; 733:241-55), in which case the fragmentationand tagging steps may be done simultaneously, i.e., in the same reactionusing a process that is often referred to as “tagmentation”. In otherembodiments, the fragmenting may be done mechanically (e.g., bysonication, nebulization, or shearing) or using a double stranded DNA“dsDNA” fragmentase enzyme (New England Biolabs, Ipswich Mass.). In someof these methods (e.g., the mechanical and fragmentase methods), afterthe DNA is fragmented, the ends are polished and A-tailed prior toligation to the adaptor. Alternatively, the ends may be polished andligated to adaptors in a blunt-end ligation reaction. In otherembodiments, the DNA in the initial sample may already be fragmented(e.g., as is the case for FPET samples and circulating cell-free DNA(cfDNA), e.g., ctDNA). The fragments in the initial sample may have amedian size that is below 1 kb (e.g., in the range of 50 bp to 500 bp,or 80 bp to 400 bp), although fragments having a median size outside ofthis range may be used.

In some embodiments, the amount of DNA in a sample may be limiting. Forexample, the initial sample of fragmented DNA may contain less than 200ng of fragmented human DNA, e.g., 10 pg to 200 ng, 100 μg to 200 ng, 1ng to 200 ng or 5 ng to 50 ng, or less than 10,000 (e.g., less than5,000, less than 1,000, less than 500, less than 100 or less than 10)haploid genome equivalents, depending on the genome.

Bioinformatics Workflow

In some embodiments, wherein sequence reads that have near identicalfragmentation breakpoints, the same sub-sample identifier and those thatare substantially identical are grouped into “read groups”. Fragmentswith the same fragmentation breakpoints have substantially identicalsequences at positions adjacent to the adaptor. These can be identified,e.g., in paired-end reads as substantially identical 5′ sequences fromboth read 1 and read 2. Similarly, single-end reads that traverse bothbreakpoints can be used to identify sequences adjacent to both adaptorsequences. In some embodiments, reads are trimmed to remove adaptorsequences, overhang sequences for ligation and the like. For example,all of the sequence reads in a read group may have a contiguous sequenceof at least 10, 15, 20 or 30 nucleotides at the 5′ end that aresubstantially identical to one another and a contiguous sequence of atleast 10, 15, 20 or 30 nucleotides at the 3′ end that are substantiallyidentical to one another. One way a bioinformatics workflow can beimplemented is shown in FIG. 4 . In this workflow, reads with the samesample barcode, sub-sample barcode and 5′ and 3′ breakpoint are grouped.Reads are then divided on whether they are in orientation 1 ororientation 2 relative to read1/read2 (i.e. A-c-B vs A′-c-B′). Sequencesfrom orientation 1 and 2 are then compared. Genuine variants areexpected to be present in molecules from orientation 1 AND 2. PCR or NGSerrors are only expected in orientation 1 OR 2. In some cases, the readgroups that do not contain reads from both strands of the originalfragment are discarded. In other embodiments, the read groups that donot contain reads from both strands of the original fragment areretained, but the sequence is given a lower probability of beingcorrect. In some cases, an error model derived from corrected reads canbe applied to these groups.

The informatics steps of the above-described method can be implementedon a computer. In certain embodiments, a general-purpose computer can beconfigured to a functional arrangement for the methods and programsdisclosed herein. The hardware architecture of such a computer is wellknown by a person skilled in the art, and can comprise hardwarecomponents including one or more processors (CPU), a random-accessmemory (RAM), a read-only memory (ROM), an internal or external datastorage medium (e.g., hard disk drive). A computer system can alsocomprise one or more graphic boards for processing and outputtinggraphical information to display means. The above components can besuitably interconnected via a bus inside the computer. The computer canfurther comprise suitable interfaces for communicating withgeneral-purpose external components such as a monitor, keyboard, mouse,network, etc. In some embodiments, the computer can be capable ofparallel processing or can be part of a network configured for parallelor distributive computing to increase the processing power for thepresent methods and programs. In some embodiments, the program code readout from the storage medium can be written into memory provided in anexpanded board inserted in the computer, or an expanded unit connectedto the computer, and a CPU or the like provided in the expanded board orexpanded unit can actually perform a part or all of the operationsaccording to the instructions of the program code, so as to accomplishthe functions described below. In other embodiments, the method can beperformed using a cloud computing system. In these embodiments, the datafiles and the programming can be exported to a cloud computer that runsthe program and returns an output to the user.

A system can, in certain embodiments, comprise a computer that includes:a) a central processing unit; b) a main non-volatile storage drive,which can include one or more hard drives, for storing software anddata, where the storage drive is controlled by disk controller; c) asystem memory, e.g., high speed random-access memory (RAM), for storingsystem control programs, data, and application programs, includingprograms and data loaded from non-volatile storage drive; system memorycan also include read-only memory (ROM); d) a user interface, includingone or more input or output devices, such as a mouse, a keypad, and adisplay; e) an optional network interface card for connecting to anywired or wireless communication network, e.g., a printer; and f) aninternal bus for interconnecting the aforementioned elements of thesystem.

The memory of a computer system can be any device that can storeinformation for retrieval by a processor, and can include magnetic oroptical devices, or solid state memory devices (such as volatile ornon-volatile RAM). A memory or memory unit can have more than onephysical memory device of the same or different types (for example, amemory can have multiple memory devices such as multiple drives, cards,or multiple solid state memory devices or some combination of the same).With respect to computer readable media, “permanent memory” refers tomemory that is permanent. Permanent memory is not erased by terminationof the electrical supply to a computer or processor. Computer hard-driveROM (i.e., ROM not used as virtual memory), CD-ROM, floppy disk and DVDare all examples of permanent memory. Random Access Memory (RAM) is anexample of non-permanent (i.e., volatile) memory. A file in permanentmemory can be editable and re-writable.

Operation of the computer is controlled primarily by an operatingsystem, which is executed by the central processing unit. The operatingsystem can be stored in a system memory. In some embodiments, theoperating system includes a file system. In addition to an operatingsystem, one possible implementation of the system memory includes avariety of programming files and data files for implementing the methoddescribed below. In certain cases, the programming can contain aprogram, where the program can be composed of various modules, and auser interface module that permits a user to manually select or changethe inputs to or the parameters used by the program. The data files caninclude various inputs for the program.

In certain embodiments, instructions in accordance with the methoddescribed herein can be coded onto a computer-readable medium in theform of “programming,” where the term “computer readable medium” as usedherein refers to any storage or transmission medium that participates inproviding instructions and/or data to a computer for execution and/orprocessing. Examples of storage media include a floppy disk, hard disk,optical disk, magneto-optical disk, CD-ROM, CD-R, magnetic tape,non-volatile memory card, ROM, DVD-ROM, Blue-ray disk, solid state disk,and network attached storage (NAS), whether or not such devices areinternal or external to the computer. A file containing information canbe “stored” on computer readable medium, where “storing” means recordinginformation such that it is accessible and retrievable at a later dateby a computer.

The computer-implemented method described herein can be executed usingprograms that can be written in one or more of any number of computerprogramming languages. Such languages include, for example, Java (SunMicrosystems, Inc., Santa Clara, Calif.), Visual Basic (Microsoft Corp.,Redmond, Wash.), and C++ (AT&T Corp., Bedminster, N.J.), as well as anymany others.

In any embodiment, data can be forwarded to a “remote location,” where“remote location,” means a location other than the location at which theprogram is executed. For example, a remote location could be anotherlocation (e.g., office, lab, etc.) in the same city, another location ina different city, another location in a different state, anotherlocation in a different country, etc. As such, when one item isindicated as being “remote” from another, what is meant is that the twoitems can be in the same room but separated, or at least in differentrooms or different buildings, and can be at least one mile, ten miles,or at least one hundred miles apart. “Communicating” informationreferences transmitting the data representing that information aselectrical signals over a suitable communication channel (e.g., aprivate or public network). “Forwarding” an item refers to any means ofgetting that item from one location to the next, whether by physicallytransporting that item or otherwise (where that is possible) andincludes, at least in the case of data, physically transporting a mediumcarrying the data or communicating the data. Examples of communicatingmedia include radio or infra-red transmission channels as well as anetwork connection to another computer or networked device, and theinternet or including email transmissions and information recorded onwebsites and the like.

Some embodiments include implementation on a single computer, or acrossa network of computers, or across networks of networks of computers, forexample, across a network cloud, across a local area network, onhand-held computer devices, etc. In certain embodiments, one or more ofthe steps described herein are implemented on a computer program(s).Such computer programs execute one or more of the steps describedherein. In some embodiments, implementations of the subject methodinclude various data structures, categories, and modifiers describedherein, encoded on computer-readable medium(s) and transmissible overcommunications network(s).

Software, web, internet, cloud, or other storage and computer networkimplementations of the present invention could be accomplished withstandard programming techniques to conduct the various assigning,calculating, identifying, scoring, accessing, generating or discardingsteps.

The following patent publications are incorporated by reference for allpurposes, particularly for methods by which nucleic acid molecules maybe manipulated, reagents for doing the same, for sequencing librarypreparation workflow, sequencing methods, data processing methods, andfor definitions of certain terms: U.S. Pat. No. 8,481,292, WO2013128281,and Casbon (Nuc. Acids Res. 2011, 22 e81), US20150044678, US20120122737,U.S. Pat. No. 8,476,018 and all references cited above and below.

Kits

Also provided by this disclosure is a kit for practicing the subjectmethod, as described above. A subject kit may contain at least: a) anasymmetric adaptor; and b) a plurality (e.g., at least 4-96 or more) ofpairs of primers, wherein both primers in each pair comprise a 3′ endthat is the same as or complementary to a sequence in the adaptor andwherein at least one of the primers in each pair comprises a barcode of,e.g., 2-30 nucleotides in its 5′ tail that distinguishes that primerfrom other primers. The various components of the kit may be present inseparate containers or certain compatible components may be pre-combinedinto a single container, as desired.

In addition to above-mentioned components, the subject kits may furtherinclude instructions for using the components of the kit to practice thesubject methods, i.e., to provide instructions for sample analysis. Theinstructions for practicing the subject methods are generally recordedon a suitable recording medium. For example, the instructions may beprinted on a substrate, such as paper or plastic, etc. As such, theinstructions may be present in the kits as a package insert, in thelabeling of the container of the kit or components thereof (i.e.,associated with the packaging or subpackaging) etc. In otherembodiments, the instructions are present as an electronic storage datafile present on a suitable computer readable storage medium, e.g.,CD-ROM, diskette, etc. In yet other embodiments, the actual instructionsare not present in the kit, but means for obtaining the instructionsfrom a remote source, e.g., via the internet, are provided. An exampleof this embodiment is a kit that includes a web address where theinstructions can be viewed and/or from which the instructions can bedownloaded. As with the instructions, this means for obtaining theinstructions is recorded on a suitable substrate.

Utility

As would be readily apparent, the method described above may be employedto analyze any type of sample, including, but not limited to samplesthat contain heritable mutations, samples that contain somaticmutations, samples from mosaic individuals, pregnant females (in whichsome of the sample contains DNA from a developing fetus), and samplesthat contain a mixture of DNA from different sources. In certainembodiments, the method may be used identify a minority variant that, insome cases, may be due to a somatic mutation in a person.

In some embodiments, the method may be employed to detect an oncogenicmutation (which may be a somatic mutation) in, e.g., PIK3CA, NRAS, KRAS,JAK2, HRAS, FGFR3, FGFR1, EGFR, CDK4, BRAF, RET, PGDFRA, KIT or ERBB2,which may be associated with breast cancer, melanoma, renal cancer,endometrial cancer, ovarian cancer, pancreatic cancer, leukemia,colorectal cancer, prostate cancer, mesothelioma, glioma,medullobastoma, polycythemia, lymphoma, sarcoma or multiple myeloma(see, e.g., Chial 2008 Proto-oncogenes to oncogenes to cancer. NatureEducation 1:1). Other oncogenic mutations (which may be somaticmutations) of interest include mutations in, e.g., APC, AXIN2, CDH1,GPC3, CYLD, EXT1, EXT2, PTCH, SUFU, FH, SDHB, SDHC, SDHD, VHL, TP53,WT1, STK11/LKB1, PTEN, TSC1, TSC2, CDKN2A, CDK4, RB1, NF1, BMPR1A, MEN1,SMAD4, BHD, HRPT2, NF2, MUTYH, ATM, BLM, BRCA1, BRCA2, FANCA, FANCC,FANCD2, FANCE, FANCF, FANCG, NBS1, RECQL4, WRN, MSH2, MLH1, MSH6, PMS2,XPA, XPC, ERCC2-5, DDB2 or MET, which may be associated with colon,thyroid, parathyroid, pituitary, islet cell, stomach, intestinal,embryonal, bone, renal, breast, brain, ovarian, pancreatic, uterine,eye, hair follicle, blood or uterus cancers, pilotrichomas,medulloblastomas, leiomyomas, paragangliomas, pheochromocytomas,hamartomas, gliomas, fibromas, neuromas, lymphomas or melanomas. In someembodiments, the method may be employed to detect a somatic mutation ingenes that are implicated in cancer, e.g., CTNNB1, BCL2, TNFRSF6/FAS,BAX, FBXW7/CDC4, GLI, HPVE6, MDM2, NOTCH1, AKT2, FOXO1A, FOXO3A, CCND1,HPVE7, TAL1, TFE3, ABL1, ALK, EPHB2, FES, FGFR2, FLT3, FLT4, KRAS2,NTRK1, NTRK3, PDGFB, PDGFRB, EWSR1, RUNX1, SMAD2, TGFBR1, TGFBR2, BCL6,EVI1, HMGA2, HOXA9, HOXA11, HOXA13, HOXC13, HOXD11, HOXD13, HOX11,HOX11L2, MAP2K4, MLL, MYC, MYCN, MYCL1, PTNP1, PTNP11, RARA, SS18 (see,e.g., Vogelstein and Kinzler 2004 Cancer genes and the pathways theycontrol. Nature Medicine 10:789-799). The method of embodiment may beemployed to detect any somatic mutation that is implicated in cancerwhich is catalogued by COSMIC (Catalogue of Somatic Mutations inCancer), data of which can be accessed on the internet.

Other mutations of interest include mutations in, e.g., ARID1A, ARID1B,SMARCA4, SMARCB1, SMARCE1, AKT1, ACTB/ACTG1, CHD7, ANKRD11, SETBP1,MLL2, ASXL1, which may be at least associated with rare syndromes suchas Coffin-Siris syndrome, Proteus syndrome, Baraitser-Winter syndrome,CHARGE syndrome, KBG syndrome, Schinzel-Giedion syndrome, Kabukisyndrome or Bohring-Opitz syndrome (see, e.g., Veltman and Brunner 2012De novo mutations in human genetic disease. Nature Reviews Genetics13:565-575). Hence, the method may be employed to detect a mutation inthose genes.

In other embodiments, the method may be employed to detect a mutation ingenes that are implicated in a variety of neurodevelopmental disorders,e.g., KAT6B, THRA, EZH2, SRCAP, CSF1R, TRPV3, DNMT1, EFTUD2, SMAD4,LIS1, DCX, which may be associated with Ohdo syndrome, hypothyroidism,Genitopatellar syndrome, Weaver syndrome, Floating-Harbor syndrome,hereditary diffuse leukoencephalopathy with spheroids, Olmsted syndrome,ADCA-DN (autosomal-dominant cerebellar ataxia, deafness and narcolepsy),mandibulofacial dysostosis with microcephaly or Myhre syndrome (see,e.g., Ku et al 2012 A new paradigm emerges from study of de novomutations in the context of neurodevelopmental disease. MolecularPsychiatry 18:141-153). The method may also be employed to detect asomatic mutation in genes that are implicated in a variety ofneurological and neurodegenerative disorders, e.g., SCN1A, MECP2,IKBKG/NEMO or PRNP (see, e.g., Poduri et al 2014 Somatic mutation,genetic variation, and neurological disease. Science 341(6141):1237758).

In some embodiments, a sample may be collected from a patient at a firstlocation, e.g., in a clinical setting such as in a hospital or at adoctor's office, and the sample may be forwarded to a second location,e.g., a laboratory where it is processed and the above-described methodis performed to generate a report. A “report” as described herein, is anelectronic or tangible document which includes report elements thatprovide test results that may indicate the presence and/or quantity ofminority variant(s) in the sample. Once generated, the report may beforwarded to another location (which may be the same location as thefirst location), where it may be interpreted by a health professional(e.g., a clinician, a laboratory technician, or a physician such as anoncologist, surgeon, pathologist or virologist), as part of a clinicaldecision.

EXAMPLES

Although the foregoing invention has been described in some detail byway of illustration and example for purposes of clarity ofunderstanding, it is readily apparent to those of ordinary skill in theart in light of the teachings of this invention that certain changes andmodifications may be made thereto without departing from the spirit orscope of the appended claims.

Accordingly, the preceding merely illustrates the principles of theinvention. It will be appreciated that those skilled in the art will beable to devise various arrangements which, although not explicitlydescribed or shown herein, embody the principles of the invention andare included within its spirit and scope. Furthermore, all examples andconditional language recited herein are principally intended to aid thereader in understanding the principles of the invention and the conceptscontributed by the inventors to furthering the art, and are to beconstrued as being without limitation to such specifically recitedexamples and conditions. Moreover, all statements herein recitingprinciples, aspects, and embodiments of the invention as well asspecific examples thereof, are intended to encompass both structural andfunctional equivalents thereof. Additionally, it is intended that suchequivalents include both currently known equivalents and equivalentsdeveloped in the future, i.e., any elements developed that perform thesame function, regardless of structure. The scope of the presentinvention, therefore, is not intended to be limited to the exemplaryembodiments shown and described herein. Rather, the scope and spirit ofthe present invention is embodied by the appended claims.

Theoretical Background

Different library preparation methods and NGS chemistries have their ownsystematic error profiles. It is possible to correct systematic errors,in part at least, that are correlated with known parameters includingsequencing cycle-number, strand, sequence-context and base substitutionprobabilities (Meacham et al BMC Bioinformatics. 2011 12: 451). Randomerrors can be mitigated by replicate sequencing. However, neither methodis sufficient if attempting to detect, with high specificity, a minorityvariant with frequency of ≤˜3%. To improve error-detection andcorrection, several groups have used a repeat-code approach. The idea isto sequence copies of the same molecule. The copies are then aligned anda majority-vote used to generate a consensus sequence, which removesmost of the errors. In addition, differences between each copy and theconsensus sequences can be used to build up an error-model of a specificgenomic region, which can then be applied to clean-up consensussequences. For example, see Shugay et al 2014 (Towards error-freeprofiling of immune repertoires. Nature Methods 11, 653-655 (2014)).

Four main methods can be applied to identify copies of the samemolecule: (1) fragmentation breakpoints; (2) orientation of a DNA insertcompared to surrounding adaptor sequences; (3) DNA barcodes and (4)physical separation.

If, by chance, two molecules have the same 5′ and 3′ breakpoints then wemight incorrectly group reads and attempt to generate a consensussequence. This could result in a false negative if a variant base wasnot called as part of the consensus sequence (e.g. if there were agreater number of, or higher quality, reads including a non-variantbase). Additionally, our error model could be tricked into assuming thata variant base position was error-prone when in fact genuine calls aremixed from two, or more, different molecules. To mitigate this effectNewman et al (2014) classified unique molecules as those with unique 5′and 3′ breakpoints and 100% sequence identity, ignoring low quality basecalls. This reduces false negative variants but is likely to increasefalse-positive calls owing to grouping of reads with errors. Newman etal (2012) appear aware of this deficiency as they discuss implementingmolecular tagging approaches to improve data quality.

Cell-free or circulating tumour DNA (ctDNA) is tumour DNA circulatingfreely in the blood of a cancer patient. Protocols to extract ctDNAgenerally aim to reduce contamination with normal DNA from leukocytes.This is achieved by rapid processing of whole blood by centrifugation toremove all cells, and analysis of the remaining plasma. ctDNA is highlyfragmented, with a mean fragment size ˜165 bp. Newman et al (Nat Med.2014 20: 548-54) made libraries from 7-32 ng ctDNA isolated from 1-5 mLplasma. This is equivalent to 2,121-9,697 haploid genomes (assuming 3.3pg per haploid genome). This range represents the maximum number ofunique molecules that can be captured and sequenced. In practice themaximum number of molecules that can be captured is reduced by randomfragmentation of regions covered by bait targets and inefficienciesduring library preparation and target recovery.

To estimate the frequency of molecules with identical 5′ and 3′breakpoints, one can make several simplifying assumptions: (1)breakpoints are randomly distributed; (2) to be captured a fragment musthave 100% match to an RNA bait; and (3) that the library has a fixedsized range between 120-165 bp. For example, imagine a fragment where a5′ breakpoint maps 25 bp upstream of the RNA bait. To be within thefixed library size range the 5′ breakpoint could be associated with anyof 20 different 3′ breakpoints. The same calculation can be performedfor each of the possible 5′ and 3′ breakpoints that generate fragmentswithin the size range. The total number of breakpoints can be calculatedusing: Σ_(d=0) ⁴⁵ d+1 where d is the difference in length between themaximum fragment length and the RNA bait length. In the above example,d=45 giving 1,081 breakpoints. Next, we can estimate the number ofduplicate molecules using collision theory. The expected total number oftimes a selection will repeat a previous selection as x integers arechosen from a list of y integers (1, y) equals:

$x - y + {{y\left( \frac{y - 1}{y} \right)}^{x}.}$

In our case, this can be paraphrased as the expected total number oftimes a captured molecule x will have the same two breakpoints asanother captured molecule, where y is the number of molecules withdifferent breakpoints in the library. For example, if x=1,000 andy=1,081 then 347 captured molecules are expected to have the samebreakpoints as another captured molecule. In practice, the number ofmolecules that cannot be uniquely identified is likely higher than 347because some of the 1,081 breakpoint combinations are likely to beobserved more often than others, owing to the distribution of fragmentsizes around a mean length and biases in fragmentation breakpoints. Thissuggests that one needs information in addition to the fragmentationbreakpoints in order to uniquely identify molecules forerror-correction.

DNA barcodes can be used on their own, or in addition with fragmentationbreakpoints, to help identify duplicate molecules. There is an importantdistinction between methods where a pool of DNA barcodes, with differentsequences, is attached en masse and split-pool methods where individualDNA barcodes are attached in separate reactions before pooling. Ifbarcodes are attached en masse, they are added to template DNA beforeamplification (and if they are attached en masse by extension of abarcoded primer, great care must be taken to ensure that residualunextended primers are removed before amplification). These restrictionsare not necessary if using a split-pool method, where sub-samplebarcodes can be added before, during, or after, amplification. Inaddition, en masse tagging usually requires that the pool of barcodesequences are carefully pooled before tagging. If barcodes are pooledfrom individual synthesis reactions then care must be taken not to over-or under-represent individual barcode sequences. Alternatively, ifbarcodes are degenerate then there can be bias. This restriction doesnot occur in split-pool methods where each sub-sample is associated withone, or a pair, of barcode sequences. FIG. 3 shows a split-pool method.

How many separate sub-samples are required? This depends on theprobability that two, or more, fragments in the initial library have thesame 5′ and 3′ breakpoints. In the earlier worked example, the expectedtotal number of times a captured molecule x will have the samebreakpoint as another captured molecule, where y is the number ofmolecules with different breakpoints in the library equaled 347, wherex=1000 and y=1,081. If we split the reaction over eight sub-samples thenx=125 (1000/8) molecules per sub-sample. We can then use the collisionformula with x=125 and y=1,081 to estimate that in each sub-sample ˜7tagged molecules are expected to have the same breakpoint, and barcodesequence, as another molecule. If we pool molecules from each sub-sampleinto a single pool then 56 (7*8) molecules are expected to have the samebreakpoint, and barcode sequence, as another molecule. This isapproximately a ˜6-fold (347/56) reduction in duplicates than if we didnot split, tag and pool. Similarly, if we use 16 wells then x=63(1000/16), y=1,081 and the estimated number of duplicates is ˜2, anapproximately ˜174-fold (347/2) reduction in duplicates. This suggeststhat separate reactions can be set up in a 96-well plate, rather thanrequiring specialist microfluidics.

Physical separation methods include clonal amplification in amicrodroplet or on a solid-surface (e.g. an Illumina flow-cell).Microdroplet methods generally rely on limiting dilution of templatewhere each microdroplet contains only a fraction of the total genome(digital PCR, RainDance Technologies' Thunderstorm platform for targetenrichment or 10× Genomics GemCode system). However, current physicalmethods require complex microfluidics.

Example 1

If double-stranded DNA is tagged with Y-shaped asymmetric adaptors theneach strand has a different insert orientation relative to read 1 and 2adaptor sequences (FIG. 2 ). Hybridization capture using AgilentSureSelect is specific to one strand. If a non-amplified library iscaptured then sequencing data derives from only one of the two strands.To derive sequencing data from both strands, a single primer-extensionor PCR step is used before capture (see FIG. 2 ).

In one embodiment, ctDNA is end-repaired, 3′ adenylated and ligated togeneric Y-shaped adaptors. In the Agilent SureSelect^(XT2) TargetEnrichment System for Illumina Paired-End Multiplex Sequencing protocol,5-8 cycles of PCR are used to bulk amplify the library. Adaptors, or PCRprimers, include sample-specific barcodes for multiplexing prior tohybridization capture. Instead, in the approach described here, onelibrary is split into multiple separate PCR reactions or sub-samples.Each PCR reaction is labelled with barcode(s) using one, or both,primers. After PCR, the separate reactions are pooled together and usedas template in a hybridization capture. After capture, and washing,libraries are bulk PCR amplified before paired-end NGS. Fragments areidentified using 5′ and 3′ breakpoints in addition to barcodesequence(s) (see FIGS. 2 and 3 ).

Example 2

In a second embodiment, ctDNA is end-repaired, 3′ adenylated and ligatedto generic Y-shaped adaptors. The library is split into multipleseparate reactions. In each reaction the library is amplified by PCRusing generic PCR primers. After amplification, barcodes are ligated tothe PCR amplicons. After ligation, the separate reactions are pooledtogether and used as template in a hybridization capture. After capture,and washing, libraries are bulk PCR amplified before paired-end NGS.Fragments are identified using 5′ and 3′ breakpoints in addition tobarcode sequence(s) (see FIGS. 2 and 3 ).

1-21. (canceled)
 22. A method for determining if a potential sequencevariation is in the top and bottom strands of the same DNA fragment,comprising: (a) ligating an asymmetric adaptor to a sample thatcomprises fragments of DNA, to produce an adaptor-ligated sample; (b)splitting the adaptor-ligated sample into a plurality of sub-samples,wherein each of the sub-samples is placed in one container and thedifferent sub-samples are placed in separate containers; (c) separatelytagging each of the different sub-samples with one of a plurality ofsequence tags that identify the different sub-samples of step (b) bypolymerase chain reaction (PCR) using primers that have a 5′ tail thathas a sub-sample identifier sequence, wherein the 5′ tail does nothybridize to the adaptor-ligated sample, to produce tagged sub-samples;(d) sequencing polynucleotides from each of the tagged sub-samples ofstep (c), or copies of the same, to produce sequence reads, each of thesequence reads comprising: (i) one of the plurality of sequence tagsthat identify the different sub-samples and (ii) the sequence of atleast part of a fragment from one of the sub-samples, wherein some ofthe sequence reads of step (d) are derived from the top strand of thefragment from one of the subsamples and some of the sequence reads ofstep (d) are derived from the bottom strand of the same fragment. 23.The method of claim 22, wherein the method comprises grouping thesequence reads of step (d), wherein sequence reads that have identicalsequences, identical fragmentation breakpoints and the same sub-sampleidentifier sequence are placed in a group.
 24. The method of claim 22,wherein the method comprises: pooling the tagged sub-samples of step (c)prior to step (d) to produce a pooled sample, and wherein step (d)comprises sequencing nucleic acids in the pooled sample.
 25. The methodof claim 22, wherein the polynucleotides sequenced in (d) are selectedby target enrichment.
 26. The method of claim 25, wherein the targetenrichment is done by polymerase chain reaction.
 27. The method of claim22, wherein the asymmetric adaptor is a Y adaptor.
 28. The method ofclaim 22, wherein the asymmetric adaptor comprises a sample identifiersequence that identifies the sample to which the asymmetric adaptor isadded, wherein the polynucleotides from each of the tagged sub-sampleseach comprise one of the sequence tags that identify the sub-samples andthe sample identifier sequence, and wherein the sequencing reads of step(d) further comprise the sample identifier sequence.
 29. The method ofclaim 22, wherein one or more of the sequence tags used in step (c)contain a sample identifier sequence that identifies the sample of step(a), wherein the polynucleotides from each of the tagged sub-samples ofstep (c) each comprise one of the sequence tags that identify thesub-samples and the sample identifier sequence, and wherein thesequencing reads of step (d) further comprise the sample identifiersequence.
 30. The method of claim 22, wherein, in step (b) theadaptor-ligated sample is split into at least 4 sub-samples.
 32. Themethod of claim 22, wherein the sample of step (a) comprises fragmentsof human genomic DNA.
 33. The method of claim 22, wherein the sample ofstep (a) is obtained from a cancer patient.
 34. The method of claim 22,further identifying a minority variant sequence in the sequence reads.35. The method of claim 34, wherein the minority variant is a somaticmutation.
 36. The method of claim 22, wherein the PCR of step (c) iscomposed of 4 to cycles.
 37. The method of claim 22, further comprising:(e) identifying a potential sequence variation in a group of thesequence reads of step (d), wherein the group of sequence readscorrespond to the top strand of the fragment from one of thesub-samples; and (f) determining if the potential sequence variation isin a group of the sequence reads of step (d) that corresponds to thebottom strand of the same fragment.